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Abstract 

We have developed a combined feature based and shape based 
visual tracking system designed to enable a planetary rover to 
visually track and servo to specific points chosen by a user 
with centimeter precision. The feature based tracker uses in- 
variant feature detection and matching across a stereo pair, 
as well as matching pairs before and after robot movement 
in order to compute an incremental 6-DOF motion at each 
tracker update. This tracking method is subject to drift over 
time, which can be compensated by the shape based method. 
The shape based tracking method consists of 3D model regis- 
tration, which recovers 6-DOF motion given sufficient shape 
and proper initialization. By integrating complementary al- 
gorithms, the combined tracker leverages the efficiency and 
robustness of feature based methods with the precision and 
accuracy of model registration. In this paper, we present the 
algorithms and their integration into a combined visual track- 
ing system. 
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1. Introduction 

Goal level, single cycle activity commanding for planetary 
rovers requires a high degree of robotic autonomy. The 2009 
Mars Science Laboratory (MSL) rover will be required to 
navigate to a scientifically relevant feature from a distance of 
10 meters away and place a contact instrument within 1 cm of 
the specified location. This capability will require the rover 
to track specified points with centimeter precision and servo 
directly to them. Because features are selected for scientific 
relevance, they are not necessarily those features which best 
facilitate appearanc- based visual tracking. 

We have developed a combined feature based and shape based 
visual tracking system that leverages the benefits of each 
method in a complementary manner. 

In order to handle large errors in platform motion prediction, 


0-7803-723 1 -XA3 I/S 10.00/® 2002 ieee 
IEEE AC paper # 1297 


reduce tracking frequency, and handle targets which do not 
facilitate unambiguous appearance based matching, the sys- 
tem uses one tracker which makes use of invariant feature de- 
tection and matching. The feature detector finds large popu- 
lations of features around the feature of interest. By matching 
features across a stereo pair, as well as matching pairs before 
and after robot motion, the tracker can quickly compute a 6- 
DOF motion. In a static environment this 6-DOF transforma- 
tion describes the motion of the tracked point. RANSAC is 
used to provide robustness to errors during feature detection 
and matching. Because the recovered motion is incremental, 
compounding the transformations found by the tracker leads 
to target drift over time. This is compensated by making use 
of a shape based tracker. 

The shape based tracker employs 3D terrain model registra- 
tion based on nonlinear optimization. Model registration can 
provide a strong cue for recovering 6-DOF motion if the mod- 
els have sufficient shape and the optimization is initialized 
sufficiently close to the solution. By using the output of the 
feature based tracker to initialize the registration, we are able 
to align the current target view to the original view, thereby 
eliminating drift incurred by compounding the incremental 
motions recovered by the feature based tracker. 

The combined tracking system is capable of tracking user- 
specified points for robotic navigation with centimeter level 
accuracy over distances on the order of ten meters. This track- 
ing system is a critical component of the integrated single cy- 
cle instrument placement work demonstrated at NASA Ames 
Research Center. 

2. A TALE OF TWO TRACKERS 

Our robot navigation and instrument placement system uses 
two vision based tracking methods to provide fast, accurate 
incremental updates to the target location estimate during 
robot navigation as well as high precision error correction 
when the rover reaches an intended science target. Both of 
these tracking methods make the assumption that the target 
and the scene around it do not change, i.e. that the world is 
physically static. Lighting conditions may change, since our 
system may operate autonomously for a few hours. 

The first method is an appearance based method that uses the 
SIFT algorithm^ 1] at its core. The SIFT algorithm is used to 
reliably find interesting points in stereo cameras and to find 


putative matches between image pairs before and after mo- 
tion. Stereo cameras provide 3D point locations for matched 
interest points in stereo views and matched 3D points are used 
to recover a rigid transformation aligning the points. Robust 
estimation makes the method tolerant of outliers and mis- 
matches. Ultimately each target might be tracked using hun- 
dreds or thousands of matched points, providing a high degree 
of redundancy and accuracy. 

Given two meshes generated from two different views, the 
second tracker method makes use of a virtual range sensor 
in order to determine depth at each correspondence point be- 
tween the two meshes. By minimizing the difference between 
the rendered depth at each point, we can extract a rigid trans- 
formation that aligns the two models, thereby allowing us to 
determine the coordinate transformation between views. 

Our combined tracker takes advantage of both of these meth- 
ods to provide an integrated visual tracking infrastructure 
which is fast and robust during a traverse, and can provide 
bounded error at the end of a target approach. 

Feature based tracker 

Many feature based trackers operate by matching a chosen 
template to an area of interest in successive images. The 
search is often done using an exhaustive correlation or con- 
volution, which can be expensive when precise predictions 
are not available or large camera motions must be tolerated. 
These trackers may offer the user the flexibility to specify 
the template, but the specified template may not be amenable 
to tracking due to low visual texture or changing appearance 
during motion. In addition, if the tracker only keeps track of 
one nominal target point, it is brittle in the event of a mis- 
match, and vulnerable to occlusions or changing viewpoints 
or other physical constraints. 

The appearance based tracking algorithm used in our system 
uses large numbers of image features matched across stereo 
pairs. Feature matching is done using the SIFT algorithm]!]. 
The SIFT algorithm consists of two steps. The first step is the 
extraction of interest points from an image. Interest points are 
local maxima in scale space, found by searching for points in 
a Laplacian image pyramid[2] with higher values than neigh- 
bors in x, y and the scale dimension. The interest operator 
used by SIFT is invariant to rotation, translation, and spatial 
scale[l]. Once these interest points are found, a local orienta- 
tion is estimated. The local image patch is then used to com- 
pute a feature vector, or descriptor , which is computed using 
some edge statistics in the neighborhood of the interest point, 
where the neighborhood is defined by the location, orienta- 
tion, and scale recovered by the interest operator. This means 
that a large number of interest points are identified in image 
pairs under Euclidean or approximately Euclidean transfor- 
mations in the images. The descriptors also tend to be fairly 
robust, so that a nearest neighbor search in feature space tends 
to find a large number of matched points in two images. 


Our 3D SIFT based tracker uses features extraced from four 
images-two stereo pairs-to recover the incremental motion 
of the tracked target in the robot coordinate frame. We re- 
fer to one stereo image pair as \L Z , j Rj}. From these images 
the SIFT algorithm extracts and matches features {7*, r t } be- 
tween the images, providing matched pairs of image points 
Zi = (if ,rf) T . The 3D location x z corresponding to the 
matched pair 2 , is recovered through stereo, 

Xi = f{Zi) ( 1 ) 

By arbitrary choice, the SIFT descriptor for the left image 
point U is taken as the descriptor for the 3D point x,. 

When the next image pair {Li + i, R,+i} is acquired, matched 
features Zj+i = (lf+n r T+i) T are found and 3D points 
Xj+i = f(zi+ 1 ) are recovered. Using the SIFT descriptors 
from x l and x^ +1 , putative matches can be found between the 
3D points extracted before and after robot motion. 

Once a set of putative 3D point matches are found, we es- 
timate the 6-DOF transformation from one view to the next 
using Horn’s method[3] and RANSAC[4], Horn’s method 
will find the least mean square rotation and translation be- 
tween a set of matched points in closed form[3]. However, 
because Horn’s method minimizes a second order cost func- 
tion, errors (outliers) in SIFT matching either between image 
pairs or between the 3D points can cause arbitrarily large er- 
rors in the recovered transformation parameters. To identify 
and eliminate outliers we use the robust estimation algorithm 
RANSAC[4] to find the transformation that is consistent with 
the largest number of points, or inliers. 

Inliers are defined as those putative matches {xf\ xfifi } 
such that 

ll^+t-^x^ll <T (2) 

where t is a threshold. Currently we use r = 3 cm and repeat 
the RANSAC loop for M = 100 times, which takes negligi- 
ble time since Horn’s method is very fast. RANSAC returns 
the transformation with the largest consensus, and the list of 
matches in the consensus set. To further improve the estimate 
we use the consensus set to re-estimate the transform with all 
inliers. If the set of inliers changes we continue to re-estimate 
the transform until the consensus set converges. We denote 
the resulting transformation by T- +1 * and the inlier set by J. 
The steps above are also shown in Figure 1 . 

Once the rigid transformation is computed, the tracked 
feature location is simply updated by applying the transfor- 
mation to the target location 

x' +1 = I7 +1 *Xo (3) 

Note that mismatches may occur in two different steps in the 
tracking algorithm. Mismatches between l z and r t will lead 
to erroneous coordinates for x z . Mismatches between points 
xf ' 1 and x^j will lead to 3D point pairs which are not con- 
sistent with a single rigid body motion. Both of these kinds 
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Figure 1. SIFT based tracker diagram 



Figure 2. Uncertainty in the 3D coordinates of the initial 
target selection due to stereo errors 


of outliers are handled by the robust absolute orientation. No 
explicit outlier rejection is needed in the 3D feature extraction 
prior to solving for absolute orientation. 

Uncertainty 

As the feature based tracker tracks a science target, two mea- 
sures are used to estimate the performance of the system. The 
first is the uncertainty in the target location represented by a 
3x3 covariance matrix over the XYZ location of the tracked 
feature, which is useful for geometric reasoning about the 
precision of the target location estimate for camera pointing 
and target handoff. The second is a single number represent- 
ing a qualitative, overall confidence measure for the tracker 
which is useful for planning and execution purposes and de- 
tecting tracking failures. 

The tracker uncertainty takes into consideration the initial tar- 
get location specification as well as compounding the uncer- 
tainties in all of the tracker updates. The initial target location 
uncertainty is computed assuming a half-pixel standard devi- 
ation in the user specified location in the reference camera 
image, as well as an uncorrelated half pixel standard devia- 
tion in the stereo disparity matching in the other stereo cam- 
era. The initial location and its covariance matrix are found 
by taking the unscented transform [5] of equation (1) above 
with zo = (Io',J"q’) t and 

Pzz = <7 2 I 4 X4 (4) 

with a = 1/2 to yield the 3D location and 3x3 covariance 
matrix P xx . 

At each tracker update, the RANSAC method above is used 
to find the set of inlying matches that can be used to compute 
the dominant rigid transformation. However, Horn’s method 
returns only a point estimate, without any information about 
the uncertainty in the estimate. In order to compute the co- 
variance of the estimator, we use bootstrap [6]. Analytic ap- 
proaches to propagating uncertainties through Jacobians of 
the norm minimized by Horn’s method do exist[7], but the 
non-parametric approach we use is theoretically sound, triv- 
ial to implement, and makes significant reuse of the estimator 
code already implemented. 

Bootstrap is a Monte Carlo method. To compute a bootstrap 
estimate of the covariance of the absolute orientation esti- 


mate, we generate a population of matched point sets from the 
inlier set J by sampling with replacement to yield B boot- 
strap sets J b . For each set of matches J b , we compute the 
transform T b using Horn’s method and recover the transform 
parameters 9 b . Our current implementation recovers transla- 
tion and Euler angles 1 , but other rotation representations are 
equally feasible. From the population of estimates 9 b , an em- 
pirical covariance is computed, 

1 B 

^ = ( 5 > 
D 6=1 

where 8‘ is the transform parameters corresponding to the 
optimal estimate TJ ,+1 * . The covariance matrix for the up- 
dated location of the feature is then given by an unscented 
transform on the update 

P+ 1 = ( 6 ) 

with x l 0 and P xx from the previous tracker update, and 9* 

rinH P* frrtm Hrtm’c l? inH KrxrxfcUrorx 'TVjg 

notation T(9 l ) indicates the rigid transformation parameter- 
ized by the rotation and translation parameters 6 l . Note that 
because tracking is done in an incremental fashion, the co- 
variance P xx is monotonically growing during a tracking run, 
i.e. the tracker accurately models the fact that incremental 
updates with small errors will compound into a larger drift 
over time. Our system typically has incremental errors on 
the order of millimeters and milliradians per tracker update, 
so that a single target approach accumulates only centimeters 
of error. The 3D model registration step described below is 
designed to recover from this potential drift. 

In addition to the geometric uncertainty in target location rep- 
resented by the covariance matrix P xx , the tracker maintains 
a single confidence value as a qualitative measure of how 
well the target is being tracked. This number is computed 
assuming a simple function of the number of inliers found by 


1 Euler angles can present problems due to singularities, and may not be 
amenable to representation by a Gaussian (mean and covariance). However, 
this work is applied to a surface rover with limited roll and pitch angles, 
avoiding the singularities in the representation, and the absolute orientation 
estimates tend to be highly overconstrained and yield very small covariance 
matrices, so the representation only needs to be accurate over a small neigh- 
borhood of the parameter space. 
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Figure 3. Tracker result with confidence interval shown as 
an ellipse around the tracked point. The tracker was 
initialized with the upper left comer of a fiducial on a rock. 
Note that the fiducial was not explicitly used in the tracking, 
but placed to facilitate performance evaluation. 


RANSAC above, that is the confidence C is given by 

C = 1/(1 + e aQJ] ~ N) ) (7) 

where through experimentation we have set N = 30, and 
a = 0.1. This confidence measure reflects the fact that if 
fewer than N inliers are found, then the tracker uncertainty 
should be low while if significantly more inliers are found 
then the tracker confidence should be high. 

These confidence measures are somewhat ad hoc, but are only 
intended to capture some useful qualitative information about 
tracker performance. The confidence is used by the onboard 
executive to determine when the risk of losing a target is high 
enough to warrant a change in activity, e.g. to approach a dif- 
ferent target with higher expected utility. In our experiments 
the tracker tends to either find a large number (hundreds) of 
matches or very few, and the overall system typically does 
what the rover operators would expect by identifying track- 
ing failures and aborting when necessary. 

Starting with the target location x’ Q , covariance matrix P xx , 
point locations x z and descriptors d* from the previous time 
step, the tracker update proceeds as follows: 

1. Find matching SIFT features U+\ and r t+1 

2. Recover point locations x z+ i 

3. Find putative matches between Xi and x i+ i 

4. Repeat M times: 

(a) Choose 3 putative matches at random 

(b) Find rigid transformation T t t_rl 

(c) Find the number of inliers (consensus) 

5. For the best consensus set J, 

(a) Compute 9* using matches J 

(b) Find inliers J under transform T(6”) 

(c) If inlier set changes, repeat 

6. Compute confidence C based on \J\ using (7) 

7. Compute Pee using Bootstrap 

8. Compute Xa +1 and P^ 1 using (6) 

9. Return x 1 2 3 4 5 6 7 8 9 ^ , P^ 1 and C 



Figure 4. Each pixel in the range image is predicted by 
rendering the corresponding mesh facet into a virtual range 
sensor. 


3D shape based tracker 

The 3D shape based tracker uses terrain model registration to 
recover 6-DOF motion from stereo cameras. Tracking is per- 
formed by registering successively acquired terrain models of 
the target area to the initially acquired model of the target. By 
using an initial target template throughout the tracking cycle, 
successful registration to the current view at each step pro- 
vides an an estimate of the goal location that does not drift 
over time. 

For every pixel in the left camera image for which a corre- 
spondence is found in the right camera image, our stereo al- 
gorithm estimates the depth to that point. These depth esti- 
mates are combined to produce a 3D model of the surface. If 
two models of a surface are made from different locations, the 
rigid transformation that aligns the two models can be used to 
determine the coordinate transformation between views. 

The surface models are represented by trangulated meshes 
with verticies v and v'. If the two 3D models contain some 
region of overlap, there is a rigid transformation that aligns 
the overlapping regions. We represent the rigid transforma- 
tion using the parameter vector p — (x,y, z, a, 0, ■y) T corre- 
sponding to 3 translational and 3 rotational degrees of free- 
dom. These parameters define a transformation matrix T p . 
If p is the parameter describing the transformation between 
surfaces v and v', then for every pair of corresponding points 
Vj and v' the relationship 

v' - T p v, = 0 (8) 

holds. With real observations this equality will not hold ex- 
actly. 

Our mesh registration approach projects these two models 
into a virtual range sensor view and minimizes the difference 
between the rendered depths at each point. The rendering 
takes O(n) operations, where n is the number of pixels in the 
virtual range sensor. For each triangle on the mesh v', the 
vertices v', v' , and vj. are projected onto the image plane. 
For every pixel inside that triangle, the location of the inter- 
section of the camera ray c z and the facet of the mesh is a 
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point s', given by 

s' = a*v' + af v' + a k v' k (9) 

with a z + aj + a k = 1. The depth to the intersection point is 
the z coordinate in the camera frame, 

Zi = n c • s' (10) 

The vector of all depths z z is denoted z. The surface model 
v' does not move during registration, so z is a constant. 

The depth to the point changes with transformation p. 

St = Tp(c*iVj + ajVj + a k v k ) 
hi( p) = h c Si (11) 

We define a robust objective function which is the sum of the 
absolute deviations between the projected depths: 

^(p) = Mp) - z tl O 2 ) 

Because the J( p) has a local minima, we first perform a 
coarse correlation search in order to narrow down the loca- 
tion of the best solution. Our initial estimate of p, po comes 
from the stereo SIFT-based tracker described earlier. Con- 
sider that p is decomposed into rotational component r and 
a translational component t. Furthermore, consider that t is 
decomposed into: 

t = xc x + yc y 4- zc z (13) 

where c x and c y are in the plane of the virtual range sen- 
sor, and c z is the pointing direction of the sensor. Because a 
searcn over uie 6 dunensions of p is expensive, we make a 
few approximations. 

For small changes in t, hi(x,y,z -F Az, r) ~= 
h,(x, y.z, r) + Az. In other words, a change in transfor- 
mation along the z-axis of the virtual range sensor by some 
distance Az changes h, by approximately the same amount. 
Our initial estimate of r is approximately correct. 

These two approximations allow us to perform the correlation 
search across only two dimensions; the x-axis and y-axis of 
the virtual range sensor. 

For every Ax and Ay searched, the transformation p is com- 
puted by translating initial estimate po by Ax and Ay and by 
translating in the directions of the x-axis and y-axis of the vir- 
tual range sensor. The correction Az to zq which minimizes 
the objective function J(x o + Ax, yo + Ay, zo + Az, ro) is 
calculated as follows: 

Az = median(hi(x o + Ax, yo + Ay, zq, r 0 )) (14) 

As described above, the correlation search uses approximate 
knowledge of the three orientation parameters to search only 



Figure 5. Registration result: (a) hazcam image (b) range 
image (c) depth error after range image correlation (d) depth 
error after nonlinear optimization 


over the sensor x and y coordinates, solving for the average 
difference in z. Once the correlation search finds an approxi- 
mate solution, we optimize over all 6 rigid transformation pa- 
rameters using Nelder Mead[8], which is a general local non- 
linear optimization method. Nelder Mead only requires a cost 
function, not any derivative information, so the cost function 
in equation (12) is used directly. In order to avoid problems 
with early termination[8], we restart the Nelder Mead opti- 
mization twice after it converges. Figure 5 shows an example 
result of the depth error after Nelder Mead converges. 

3. Results 

We ran the combined tracker through a simple test scenario 
on the K9 rover[9] in the NASA Ames Marscape. The test 
scenario was repeated over the course of September 22nd and 
23rd, 2004. In the scenario, an operator selects three tar- 
gets such that the straight-line distance from the rover’s arm 
workspace at the starting position to the targets are approxi- 
mately 5, 7.5, and 10 meters. The rover is then commanded to 
navigate to each of the targets in turn, while tracking their lo- 
cation with the combined tracker. The rover avoids obstacles 
using the CLARAty[10] navigator package, which is based 
on the Morphin algorithm! 1 1], 

After the rover arrives at each of the rocks, it is commanded 
to move its arm such that the CHAMP[12] camera contacts 
the rock as close as possible to the tracked target location. 
The instrument placement code analyzes the scene before the 
arm is moved, to determine the closest point on the rock that 
is safe for the CHAMP to touch, and plans a collision-free 
path for the arm. 

Tables 1 and 3 show the results for these two days of testing. 
For each target, we record the elapsed time of the traverse 


5 




(which can be large if several obstacles have to be driven 
around), the accuracy of the target as tracked by the feature 
based tracker relative to 3-D models generated by the same 
camera pair as is used in the tracking, and the accuracy of the 
3-D shape-based tracker used for handoff from one pair of 
cameras to another. The placement accuracy is also recorded, 
though the placement error can be arbitrarily large since the 
system places a higher priority on safety than on accuracy of 
placement. Placement figures are not available for Septem- 
ber 22nd; a motor failure in one of the aim joints prevented 
successful placement. 

The only failure in tracking occured in the feature based 
tracker for the second rock on September 23rd. In this case, 
the tracker failed just as the rover approached the rock and in- 
troduced a large cast shadow into the scene. Once the tracker 
was unable to find a transformation between subsequent im- 
ages, it stopped updating the target location and fell back to 
dead reckoning. After the navigation was finished, the shape- 
based tracker was able to recover the target with accuracy 
comparable to the other experiments. 


Target 

1 (5m) 

2 (7.5m) 

3 (10m) 

Time to reach target 

21 mins 

+42 mins 

+ 17 mins 

Tracker accuracy 

0.68 cm 

0.29 cm 

1.3 cm 

Hand-off accuracy 

0.5 cm 

2.7 cm 

1.7 cm 

Placement accuracy | 

N/A 

N/A 

N/A 


Table 1. 9/22/2004 Performance 



Table 2. 9/22/2004 Tracker 


Target 

1 (5m) 

2 (7.5m) 

3 (10m) 

Time to reach target 

25 mins 

+27 mins 

+23 mins 

Tracker accuracy 

~0.3 cm 

failed 

1.7 cm 

Hand-off accuracy 

1.3 cm 

~ 1.6 cm 

3.2 cm 

Placement accuracy 

~6.3 cm 

~1 1 cm 

~3 cm 


Table 3. 9/23/2004 Performance 


4. Conclusions 

We started this work in an effort to increase the reliability of 
our previous system, which was based largely on the shape 
based method alone. We found that the shape based method 



Table 4. 9/23/2004 Tracker 


was quite reliable so long as the initial estimate of the tar- 
get location was not incorrect by more than approximately 
half the width of the rock being tracked. Unfortunately, dead 
reckoning errors often led to our initial estimates being be- 
yond this error. 

Once we started developing the feature tracker based on SIFT, 
we found that it was reliable enough to use as the primary 
tracker in our navigation system, because the cumulative er- 
ror over a traverse of less than 10 meters was typically well 
within the half-rock tolerance that the shape based tracker 
generally requires. We decided therefore to use the shape 
based tracker only as the last step, to hand the target off from 
the long-range cameras used for approach to the front cam- 
eras used for manipulation and instrument placement. Using 
the shape based tracker as the last step ensures that the rover 
is indeed using the same point on the designated rock for in- 
strument placement as was initially chosen by the operators, 
since this point is chosen relative to another 3-D mesh of the 
rock, rather than relative to the rover or to another arbitrary 
coordinate frame. Since this change in usage, we’ve found 
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the system to be quite reliable. The two components have 
complementary strengths that yield a robust tracking system. 

Since the experiments outlined in section 3, we have demon- 
strated the system operating several times, often tracking as 
many as five targets as the rover moves. To this point, we have 
executed at least one run where the rover has navigated to five 
targets in turn, and placed the CHAMP on each of the rocks 
with very little tracking error. In some instances the feature 
based tracker has lost the target due to occlusions, and was 
able to reaquire the target after the occlusion was removed. 
The fact that the tracker is able to provide a confidence mea- 
sure allows the rover to fall back to dead reckoning if the 
confidence drops, and allows the rover’s executive to change 
the course of action entirely if a target is lost. The tracker 
performs so well, however, that we typically have to intro- 
duce failures into the system in order to test the ability of the 
executive to cope with tracking failures. 
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